1 Tidy corpus form

We’ve talked about tidy format – each variable a column, each observation a row, each type of observational unit a table. We’ve taken speeches at the Nobel prize and pages of oil company Sustainability Reports as our observations and can use this to do some transformations – do total word counts or specific word frequencies. But once we start really doing analysis at the level of the word, tidy rules would seem to dictate that we treat not individual texts but individual words as observations. Thus tidytext format is a table with one word per row.

It might help if we are more specific: we can use two signifiers for words - words and tokens. Words are unique, well, words as we conceive of them. Tokens are each individual instances of a word. So a sentence: “the brown fox jumped over the brown log” we will say has 8 tokens (number of total words) but 6 words (number of unique words, as “the” and “brown” appear twice).1 So tidy data format is one token per row.2

We convert character strings that we have been working with into tidy text with the unnest_rows() function in the tidytext package, which splits texts up by tokens (a process appropriately called “tokenization”).

# Lets use the first speech in our Nobel corpus as an example
library(tidytext)
library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
ex <- nobel[1,]
ex %>%
  unnest_tokens(output = words, input = AwardSpeech)
## # A tibble: 598 x 3
##     Year Laureate           words              
##    <dbl> <chr>              <chr>              
##  1  1905 Bertha von Suttner "on"               
##  2  1905 Bertha von Suttner "behalf"           
##  3  1905 Bertha von Suttner "of"               
##  4  1905 Bertha von Suttner "the"              
##  5  1905 Bertha von Suttner "nobel"            
##  6  1905 Bertha von Suttner "committee"        
##  7  1905 Bertha von Suttner "bj\u00f8rnstjerne"
##  8  1905 Bertha von Suttner "bj\u00f8rnson"    
##  9  1905 Bertha von Suttner "introduced"       
## 10  1905 Bertha von Suttner "the"              
## # ... with 588 more rows
## # i Use `print(n = ...)` to see more rows

We now end up with a dataframe where each row is an observation with words, year, and Nobel laureate as variables. Note also that tidytext does cleaning - whitespace stripping, to_lower and so on. We didn’t notice here as we’re using our dataframe that we have already cleaned. The first thing we might notice is the number of rows is the total word count.

2 Stop Words

Another way in which texts are frequently pre-processed for analysis and we can now look at is to remove so-called stop words. Stop words are words with grammatical rather than syntactic function - they make things we say grammatical but don’t add (much) meaning. Examples include “a”, “the”, “of” and so on. What counts as a stopword might vary on our corpus! And what sort of analysis we’re trying to do.3 All major text analysis packages have means for removing stop words. In ‘’tidytext’’ we have a list of stopwords called stop_words from a couple of different corpora (?stop_words for more info and links).

stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows
## # i Use `print(n = ...)` to see more rows
nobel_tidy <- nobel %>%
  unnest_tokens(output = words, input = AwardSpeech) %>%
  anti_join(stop_words, by = c("words" = "word"))  # by= specifies which columns to use, had they been named the same thing we could have omitted it

stop_words, we’ll recall, is a tibble, so we can easily make our own tibble of stopwords and add it. Say (for purposes of demonstration) I think that the Peace Prize award ceremony is, of course, going to talk about peace and war so we want to weed these words out of our frequency counts.

my_words <- c("peace", "war")
custom_stop_words <- tibble(word = my_words, lexicon = "my_customization")
stop_words_custom <- rbind(stop_words, custom_stop_words)
tail(stop_words_custom) # view the end of the tibble, look like our words were added correctly
## # A tibble: 6 x 2
##   word     lexicon         
##   <chr>    <chr>           
## 1 younger  onix            
## 2 youngest onix            
## 3 your     onix            
## 4 yours    onix            
## 5 peace    my_customization
## 6 war      my_customization

Now we can apply the stop_words_custom just like we did stop_words. We won’t actually do this because we probably want these words in the corpus!

3 Stemming

If we are looking at word frequencies, will we want to count “represent” and “represented” as the same word? Or “war” and “wars”? Very likely. If so we can transform our text via a process called stemming, cutting down words to their stems so that different forms of these word are recognized as being the same thing. Something like this matters in English but might really matter in an more highly inflected language like Russian.

We have a couple stemmers to choose from in R, one and the best known is the Porter Stemming algorithm. Another is hunspell, based on the popular open source and multilingual spell checker. We’ll use Porter here.

To see it in action we’ll first test it against a short test set of words and then apply to our whole corpus.

library(SnowballC)
words_to_stem <- c("going", "represented", "wars", "similarity", "books")
SnowballC::wordStem(words_to_stem)
## [1] "go"      "repres"  "war"     "similar" "book"

Stemming in action. So now applied to the entire document:

(nobel_tidy_stemmed <- nobel_tidy %>%
  mutate(word_stem = wordStem(words)))
## # A tibble: 94,107 x 4
##     Year Laureate           words               word_stem         
##    <dbl> <chr>              <chr>               <chr>             
##  1  1905 Bertha von Suttner "behalf"            "behalf"          
##  2  1905 Bertha von Suttner "nobel"             "nobel"           
##  3  1905 Bertha von Suttner "committee"         "committe"        
##  4  1905 Bertha von Suttner "bj\u00f8rnstjerne" "bj\u00f8rnstjern"
##  5  1905 Bertha von Suttner "bj\u00f8rnson"     "bj\u00f8rnson"   
##  6  1905 Bertha von Suttner "introduced"        "introduc"        
##  7  1905 Bertha von Suttner "speaker"           "speaker"         
##  8  1905 Bertha von Suttner "baroness"          "baro"            
##  9  1905 Bertha von Suttner "bertha"            "bertha"          
## 10  1905 Bertha von Suttner "von"               "von"             
## # ... with 94,097 more rows
## # i Use `print(n = ...)` to see more rows
write_rds(nobel_tidy_stemmed, "data/nobel_stemmed.Rds")

4 Top word frequencies4

Now we can find the most frequently appearing words in the corpus.

nobel_tidy %>%
  count(words, sort=TRUE)
## # A tibble: 12,329 x 2
##    words             n
##    <chr>         <int>
##  1 peace          1595
##  2 war             927
##  3 world           847
##  4 prize           646
##  5 nations         642
##  6 nobel           597
##  7 international   530
##  8 people          513
##  9 time            441
## 10 human           440
## # ... with 12,319 more rows
## # i Use `print(n = ...)` to see more rows

Remember we got this nice and informative result only because we already removed the stopwords. OTherwise we would have been swamped with “the”, “and”, and so on.

Our knowledge of how to subset tibbles will now come in pretty handy. If we want to get the most frequent words, say, before 1945 we easily do this.

nobel_tidy %>%
  filter(Year < 1945) %>%
  count(words, sort=TRUE)
## # A tibble: 4,801 x 2
##    words             n
##    <chr>         <int>
##  1 peace           346
##  2 war             301
##  3 nations         193
##  4 international   157
##  5 league          145
##  6 world           110
##  7 time             81
##  8 american         72
##  9 prize            72
## 10 europe           69
## # ... with 4,791 more rows
## # i Use `print(n = ...)` to see more rows

And we could then compare it to post-1945 word frequency and we’d see pre-WWII prize speeches were more concentrated on Europe, the League, and nations while after that we get more talk of world, people, human, committee. That makes sense.

Let’s graph this.

nobel_tidy %>%
  count(words, sort=TRUE) %>%
  top_n(15) %>%                     # selecting to show only top 15 words
  mutate(words = reorder(words,desc(n))) %>%  # this will ensure that the highest frequency words appear to the left
  ggplot(aes(words, n)) +
    geom_col()

And with just a little bit more code we can view pre-1945 and post-1945 top frequency words at the same time.

nobel_tidy %>%
  mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII")) %>%   # creating columns with label "Pre-WWII" and "Post-WWII"
  mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>%
  group_by(Period) %>%                                                # grouping by this column label so frequencies will be                                                                              calculated within group
  count(words, sort=TRUE) %>%
  mutate(proportion = n / sum(n) * 1000) %>%                     # perhaps we'd like word frequency per 1000 words rather than raw                                                                      counts?
  slice_max(order_by=proportion, n = 15) %>%                     # selecting to show only top 15 words within each group
  ggplot(aes(reorder_within(x = words, by = proportion, within = Period), proportion, fill = Period)) +    # reordering is a bit tricky, see                                                                                                     ?reorder_within()
    geom_col() +
    scale_x_reordered() +
    coord_flip() +
    facet_wrap(~Period, ncol = 2, scales = "free") +
  xlab("Word")

4.1 Word Clouds

One common visualization of word frequency is word clouds. To do this we use the package wordcloud which will work very nicely with our tidily organized data. Wordcloud2 gives color and more fancy options that you can also play with.

library(wordcloud)
library(wordcloud2)
nobel_tidy %>%
  count(words, sort=TRUE) %>%
  with(wordcloud(words, n, max.words = 100))

dat <- nobel_tidy %>%
  count(words, sort=TRUE) %>%
  mutate(word = words) %>%
  mutate(freq = n) %>%
  select(word, freq) %>%
  top_n(200)
wordcloud2(dat, size = 2)

It’s also possible to do word clouds that compare two documents. To do this we’ll need to step outside the tidyverse and organize our data in other formats so we save this for Session 5.

5 TF-IDF

As we’ve seen, there are multiple ways to calculate frequency – we can take raw counts, or term frequency (\(tf\)). Proportions, term frequency divided by total token count in a given document, are another means. Certainly proportions make it easier to compare across corpora of different size. The problem with this is that they tend to get flooded with stopwords. One means of dealing with this is to remove the stopwords as we have done, but another is to attempt to downweigh words that appear often everywhere and upweigh those that are more unusual. Inverse document frequency (\(idf\)) is a weighting system to do this – it equals the total number of documents in the corpus divided by the number of documents in the corpus that contain the given word. The greater the number of documents in the corpus in which the word does not appear (suggesting words that are unique to certain documents rather than widespread across the corpus as a whole) the smaller the denominator and, thus, the greater the ratio.

TF-IDF is is term frequency times inverse document frequency. Both are often logged. In symbols,

\[\begin{equation} \text{TF}_{t,d} = \begin{cases} 1 + \text{log}_{10} \: \text{count(t,d)}, & \text{if count(t,d)} > 0 \\ 0,&\text{otherwise.}\\ \end{cases} \end{equation}\]

\[\begin{equation} \text{IDF}_{t} = \text{log}_{10} \: \bigg(\frac{N}{\text{df}_t}\bigg). \end{equation}\]

where t is the given term, d is a given document, df\(_t\) is the number of documents in the corpus containing term t, and N is the total number of documents in the corpus. Long story short, tf-idf attempts to weight words by both frequency in an individual document and their unusualness over a corpus of documents. Every word in every document will have its own tf-idf (term frequency will vary across documents while inverse document frequency is the same across the corpus).

Let’s see how we do this in R. First we’ll compute document frequency. In order to simplify results, lets use the same subsetting of our data into pre-1945 and post-1945 – this means we’re treating pre-WWII speeches as one single document and likewise post-1945 speeches. tidytext makes it pretty easy – simply unnest tokens and then count the tokens. Note that count enable counting within groups, which we passed to count telling it to do the counts within the groups denoted in column Period. This produces the same result as passing group_by(Period) in the previous line and eliminating Period from the count() call.

nobel <- read_rds("data/nobel_cleaned.Rds") %>%
  mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII"))
nobel_words <- nobel %>%
  unnest_tokens(words, AwardSpeech) %>%
  count(words, Period, sort = TRUE)

We can then use bind_tf_idf() from tidytext. (Tidytext implements tf-idf using proportional, but not logged, tf – we’ll see some of these other versions in other packages). The function takes a first argument (other than the tidy dataframe) that is the word, a second that is the document, and third a column containing document-term counts).

tf_idf <- nobel_words %>%
  bind_tf_idf(words, Period, n)
tf_idf
## # A tibble: 17,036 x 6
##    words Period        n     tf   idf tf_idf
##    <chr> <chr>     <int>  <dbl> <dbl>  <dbl>
##  1 the   Post-WWII 13947 0.0742     0      0
##  2 of    Post-WWII  7281 0.0387     0      0
##  3 in    Post-WWII  5614 0.0299     0      0
##  4 to    Post-WWII  5591 0.0297     0      0
##  5 and   Post-WWII  5417 0.0288     0      0
##  6 a     Post-WWII  3712 0.0197     0      0
##  7 the   Pre-WWII   3526 0.0761     0      0
##  8 that  Post-WWII  2543 0.0135     0      0
##  9 is    Post-WWII  2329 0.0124     0      0
## 10 of    Pre-WWII   2191 0.0473     0      0
## # ... with 17,026 more rows
## # i Use `print(n = ...)` to see more rows

Here we see that tf-idf has zeroed out these extremely common and not very interesting terms, precisely what we’d hope an indicator like this would do. Lets see the highest tf-idf scores.

tf_idf %>% arrange(desc(tf_idf)) 
## # A tibble: 17,036 x 6
##    words    Period        n       tf   idf   tf_idf
##    <chr>    <chr>     <int>    <dbl> <dbl>    <dbl>
##  1 nuclear  Post-WWII   278 0.00148  0.693 0.00102 
##  2 saavedra Pre-WWII     22 0.000475 0.693 0.000329
##  3 locarno  Pre-WWII     20 0.000432 0.693 0.000299
##  4 angell   Pre-WWII     19 0.000410 0.693 0.000284
##  5 pan      Pre-WWII     19 0.000410 0.693 0.000284
##  6 ladies   Post-WWII    71 0.000378 0.693 0.000262
##  7 global   Post-WWII    60 0.000319 0.693 0.000221
##  8 non      Post-WWII    54 0.000287 0.693 0.000199
##  9 poverty  Post-WWII    52 0.000277 0.693 0.000192
## 10 persons  Post-WWII    51 0.000271 0.693 0.000188
## # ... with 17,026 more rows
## # i Use `print(n = ...)` to see more rows
tf_idf %>%
  mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>%
  group_by(Period) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(words, tf_idf), fill = Period)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Period, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

Many of these are names, logically enough as specific names of laureates appear mostly only when they are being awarded a prize thus increasing their idf and upweighing their tf-idf. To make this more useful we’d want to go through and week out (remove, just like stopwords) names. But still we see nuclear coming to the fore in the post-war era, reasonably enough! Things like “global”, “poverty” stand out post-war while pre-war we see “reparations”, “commercial”, more Europe specific vocabulary. A different conversation pre- and post-war.

6 POS

We have now calculated lists of words appearing with top frequency and also seen how to calculate and graph words or phrases of interest across documents. But what if we were interested in seeing highest frequency words within a larger set of words? If were were interested in a limited set of, say, geographical names or people we could just make the list, do a join to remove all other words and calculate top frequencies among this subset of words. But what if we were interested in, say, highest occurring verbs?

For this and many other reasons linguists have long been interested in writing algorithms to automatically identify parts of speech, called part-of-speech tagging or POS. A POS tagger would take a sentence such as “The brown fox jumped over the brown log” and return “The_(article) brown_(adjective) fox_(noun) over_(preposition)” and so on.

One way we could imagine going about this is with a dictionary that identified nouns, verbs, etc. but this is clearly naive – many a word can be both noun and verb and much will depend on context. So we’ll use the openNLP package and write this function5

library(NLP)
library(tm)  # load before openNLP
library(openNLP)
library(openNLPmodels.en)

POStagger <- function(text, filter = NULL){
  # defining annotators 
  sent_token_annotator <- Maxent_Sent_Token_Annotator()
  word_token_annotator <- Maxent_Word_Token_Annotator()
  pos_tag_annotator <- Maxent_POS_Tag_Annotator()
  # make sure input is a string
  text <- as.String(text)
  # defining a pipeline
  annotate_text <- NLP::annotate(text, Annotator_Pipeline(
    sent_token_annotator,
    word_token_annotator,
    pos_tag_annotator
  ))
  # generate tags
  tags <- sapply(subset(annotate_text, type=="word")$features, `[[`, "POS")
  # get tokens
  tokens <- text[subset(annotate_text, type=="word")]
  # filter out just one pos type
  if (!is.null(filter)) {
    tagged <- tokens[tags %in% filter]
  } else {
  # put tokens and pos tags in df
  tagged <- tibble(word=tokens, pos=tags) %>% 
      filter(!str_detect(pos, pattern='[[:punct:]]'))}
  return(tagged)
}

nobel <- read_rds("data/nobel_cleaned.Rds")
POStagger(nobel$text[1])

This yields a nice dataframe with tokens in the first column and POS in the second column. The POS are not just “noun”, “verb” and so on but more finely differentiated. See here for a list of abbreviations. Using the same filter, group_by, etc methods we can now find lists of most frequently words by type of speech. There are, of course, other applications we might imagine when POS-tagging is used in conjunction with other topics discussed in the workshop.

You don’t really need to worry about what’s in there but that it takes a character string (preferably cleaned, as always, to limit the chance of unexpected errors). Note, too, that the function has the option of specifying a filter that will return a string of words of a certain POS.

7 Sentiment analysis

Another commonly done sort of analysis that we can easily incorporate into our tidy workflow is that of sentiment analysis. There are numerous ways computational linguists have developed to algorithmically determine the “sentiment” of a piece of text. We focus here on a simple (and fairly naive) method that works on the level of individual words and employs a sentiment dictionary, essentially a list of words and their associated sentiments. As we will see, these same dictionary methods can also be applied to any list or lists of words and thus give analysts a more flexible tool to track vocabulary across a corpus.

tidytext includes three dictionaries, each working slightly differently.6 Let’s take a look below.

library(tidytext)
# the first time you will need to say yes to download of the sentiment dictionary
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
## # i Use `print(n = ...)` to see more rows
tail(get_sentiments("afinn"))
## # A tibble: 6 x 2
##   word     value
##   <chr>    <dbl>
## 1 youthful     2
## 2 yucky       -2
## 3 yummy        3
## 4 zealot      -2
## 5 zealots     -2
## 6 zealous      2

Here we have negative or positive sentiment (words beginning with “ab-” seem to be quite negative!). If we want to get a quick overview of the scale we can call either table() or summary()

table(get_sentiments("afinn")$value) # for categorical data, tells us categories and n of those categories
## 
##  -5  -4  -3  -2  -1   0   1   2   3   4   5 
##  16  43 264 966 309   1 208 448 172  45   5
summary(get_sentiments("afinn")$value) # summary statistics for the value column 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.0000 -2.0000 -2.0000 -0.5894  2.0000  5.0000

So we see the value of sentiments goes from -5 to 5. Calling the other two:

get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
## # i Use `print(n = ...)` to see more rows
get_sentiments("nrc")
## # A tibble: 13,872 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,862 more rows
## # i Use `print(n = ...)` to see more rows

“Bing” labels words simply negative or positive and the “nrc” lexicon labels them according to a handful of main emotions. You can take a look at the tidytext documentation for more background on the lexicons.7

So what do we do with these? The simplest method is simply to count words from the lexicons that appear in a text and add them all up. This is what is known as a dictionary method, which we’ll talk about in a little more depth below. From a practical standpoint, how do we add up all dictionary words in a text? Tidy makes this pretty easy. You’ll see the code is a little bit different based on the kind of dictionary.

For the bing dictionary we will tokenize our corpus, do an inner join8 with the sentiments dictionary which will create a new row in our Nobel corpus dataframe where every row is the associated sentiment with the corpus word and corpus words that do not have a sentiment associated with them in the bing dictionary will be eliminated from the dataframe. We then do a pivot_wider, the opposite of pivot_longer which we did in the previous session when graphing n-grams. pivot_wider will make several columns out of fewer – here taking the values of the sentiment column (“positive” and “negative”) and making them column titles and populating the column values with those from the associated rows of our n column. Take a look at what the dataframe looks like before and after transformation to make sure you understand what is happening. We do this pivot in order to then be able to do basic arithmetic – take the number of “positive” word frequencies and subtract the number of negative word frequencies, and do this for every document in the corpus. This will be our sentiment “score”. It is then a fairly straightforward path to graphing it on ggplot.

library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
# calculating text sentiment by subtracting total positive sentiment words from total negative with bing lexicon
nobel %>%
  unnest_tokens(word, AwardSpeech) %>%  ## we call our new column "word" which makes inner_joins easier 
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, year = Year) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(year, sentiment)) +
           geom_line(show.legend = FALSE) +
           geom_hline(yintercept = 0, linetype = 2, alpha = .8)

With the afinn corpus we have numeric “sentiment values” attached to words so we need to sum all the values of all words in each document.

# using the afinn lexicon
nobel %>%
  unnest_tokens(word, AwardSpeech) %>%  ## we call our new column "word" which makes inner_joins easier 
  inner_join(get_sentiments("afinn")) %>%
  group_by(Year) %>%
  summarize(sentiment = sum(value)) %>%
    ggplot(aes(Year, sentiment)) +
    geom_line(show.legend = FALSE) +
    geom_hline(yintercept = 0, linetype = 2, alpha = .8)

We can also do this over different categories over years. For instance, say we’d like to do this same kind of chart but for each of our oil company’s sustainability reports.

sr <- srps <- read_csv("data/srps.csv")
sr %>%
  unnest_tokens(word, Text) %>%  ## we call our new column "word" which makes inner_joins easier 
  inner_join(get_sentiments("afinn")) %>%
  group_by(Company, Year) %>%
  summarize(sentiment = sum(value)) %>%
  ggplot(aes(Year, sentiment)) +
    geom_line(show.legend = FALSE) +
    geom_hline(yintercept = 0, linetype = 2, alpha = .8) +
    facet_wrap(~Company, ncol = 2, scales = "free_x")

We can also use the sentiment lexicons in ways that mostly combine what we have already done in subsetting dataframes and counting words. Say we’d like to see which words in a certain category of documents are associated with negative anticipation in the pre-WWII vs post-WWII Nobel prize award speeches. We’ll do this in several steps.

First we get a dictionary of the words associated with anticipation. Then do the same with negative sentiment words. Then we again do an inner_join() that creates a new dataframe that includes only those words present in both anticipation and negative.

anticipation <- get_sentiments("nrc") %>% 
  filter(sentiment == "anticipation")
negative <- get_sentiments("nrc") %>%
  filter(sentiment == "negative")
negAnt <- inner_join(anticipation, negative, by="word") %>%
  select(word)        # we don't really need the other two columns which tell us that they are negative and anticipation words

Now we make our Nobel data “tidy” (one-token-per-row) and do another inner_join() with our negative anticipation words. What we have left are all negative anticipation words in the corpus, which we can then print.

nobel %>%
  mutate(Period = ifelse(Year >= 1945, "Pre-WWII", "Post-WWII")) %>%
  mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>% # CHanges
  unnest_tokens(word, AwardSpeech) %>%
  inner_join(negAnt) %>%
  group_by(Period) %>%
  count(word, sort = TRUE) %>%
  slice_max(order_by=n, n = 15) %>%                     # selecting to show only top 15 words within each group
  ggplot(aes(reorder_within(x = word, by = n, within = Period), n, fill = Period)) +    # reordering is a bit tricky, see ?reorder_within()
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~Period, ncol = 2, scales = "free") +
  theme_bw() +
  xlab('Words')

Wait, “mother” is a word associated with both negativity and anticipation?

get_sentiments("nrc") %>%
  filter(word == "mother")
## # A tibble: 6 x 2
##   word   sentiment   
##   <chr>  <chr>       
## 1 mother anticipation
## 2 mother joy         
## 3 mother negative    
## 4 mother positive    
## 5 mother sadness     
## 6 mother trust

Apparently so, according to this lexicon. Which offers us a nice segue into a larger discussion of dictionaries.

8 Dictionary methods

It is entirely obvious to historians that the meaning of words – not to mention latent sentiments – changes over time, place, and context within individual documents. So these methods have to be used with care. To take one instance, economists have noticed that words like “liability” have very different sentiment and latent connotations in finance and other areas of life (Loughran and Mcdonald (2011)). Thus it might behoove us to create our own dictionaries and this is something we could easily do. In this case, we (as domain experts) could compile our own dictionaries not only to judge sentiment, but to measure attention given to certain topics, tone of discussion,

Here really the only challenge is to read a dictionary into R and then use it per the above methods. Say that we have a list of words in a text file, not comma-separated, that is a dictionary for identifying when the theme of oil is brought up in a text (this is something I came up with in literally 5 seconds, please do not use this for anything – actual dictionaries should be considerably better thought out, as well as verified on real texts that it’s catching what you want it to). Each word is on its own separate line (coincidentally this is the format of the above cited Norwegian sentiment dictionary, so should you want to use that you can adopt the process here.)

oil_dict <- read_lines("data/oil_theme.txt")
oil_dict <- tibble(word = oil_dict, dictionary = "oil dictionary")
oil_dict %>%
  filter(word != "")
## # A tibble: 6 x 2
##   word      dictionary    
##   <chr>     <chr>         
## 1 oil       oil dictionary
## 2 petrol    oil dictionary
## 3 petroleum oil dictionary
## 4 gas       oil dictionary
## 5 tanker    oil dictionary
## 6 fossil    oil dictionary

In all of two lines of code we’ve made it into a tibble. We can see also we have an empty string as the last line. We take care of that and then we’re all set to use this dictionary. Now if we want to add up everytime a text uses a word in this dictionary we can easily do this.

8.1 Excercises

  • Compile your own dictionary of interest and use it to locate a theme of interest in the Nobel corpus.
  • Use our very informal oil dictionary and apply it to the Sustainability Reports. You might also scale by number of total words per corpus – are oil terms smaller proportion of sustainability reports than they were 10 years ago?

9 Mulitple word analysis

9.1 N-grams

In Google’s famous n-grams viewer, you can search not just for a single word but for phrases, sets of words “n” tokens long. Indeed, this is why it is called an “n”-gram, they are consecutive sequences of an arbitrary number of words. So far we’ve only been looking at individual words, so lets think about multi-word units.

True to form, the tidyverse will help out with this, allowing us to unnest_tokens() by the n-gram by telling R that our token of interest is now not the single word but an ngram of n length.

nobel %>%
  unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2)
## # A tibble: 234,267 x 3
##     Year Laureate           twogram                          
##    <dbl> <chr>              <chr>                            
##  1  1905 Bertha von Suttner "on behalf"                      
##  2  1905 Bertha von Suttner "behalf of"                      
##  3  1905 Bertha von Suttner "of the"                         
##  4  1905 Bertha von Suttner "the nobel"                      
##  5  1905 Bertha von Suttner "nobel committee"                
##  6  1905 Bertha von Suttner "committee bj\u00f8rnstjerne"    
##  7  1905 Bertha von Suttner "bj\u00f8rnstjerne bj\u00f8rnson"
##  8  1905 Bertha von Suttner "bj\u00f8rnson introduced"       
##  9  1905 Bertha von Suttner "introduced the"                 
## 10  1905 Bertha von Suttner "the speaker"                    
## # ... with 234,257 more rows
## # i Use `print(n = ...)` to see more rows

And we could then make a plot of the most frequently occurring 2-grams. The problem with this is that we’re going to be overrun with stopwords, but now that we have bigrams how to do take the individual stop words? We could search and remove with str_remove_all without unnesting words, we could unnest, take out the stop words, renest and then unnest by 2-grams, but we could also use a more tidy approach and separate() which is another handy command for manipulating tidyverse dataframes.

nobel %>%
  unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2) %>%
  separate(twogram, into=c("word1", "word2"), sep = " ")  # here the call states what col to separate, into which columns, and where the separation should be made, here when there is a space between the words
## # A tibble: 234,267 x 4
##     Year Laureate           word1               word2              
##    <dbl> <chr>              <chr>               <chr>              
##  1  1905 Bertha von Suttner "on"                "behalf"           
##  2  1905 Bertha von Suttner "behalf"            "of"               
##  3  1905 Bertha von Suttner "of"                "the"              
##  4  1905 Bertha von Suttner "the"               "nobel"            
##  5  1905 Bertha von Suttner "nobel"             "committee"        
##  6  1905 Bertha von Suttner "committee"         "bj\u00f8rnstjerne"
##  7  1905 Bertha von Suttner "bj\u00f8rnstjerne" "bj\u00f8rnson"    
##  8  1905 Bertha von Suttner "bj\u00f8rnson"     "introduced"       
##  9  1905 Bertha von Suttner "introduced"        "the"              
## 10  1905 Bertha von Suttner "the"               "speaker"          
## # ... with 234,257 more rows
## # i Use `print(n = ...)` to see more rows

Now we can do our stopword work. We could do an anti_join on word1 and then word2 but the other nifty thing we can do with tidy commands is use the %in% command (this being the syntax of commands used in SQL databases some of which the tidyverse lets you do).

nob <- nobel %>%
  mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII")) %>% 
  unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2) %>%
  separate(twogram, into=c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(twogram, word1, word2, sep = ' ') %>%              # now putting word1 and word2 back into a single column called twogram
  group_by(Period) %>%
  count(twogram, sort=TRUE) %>%
  slice_max(order_by=n, n = 10)  

nob %>%
  mutate(twogram = reorder(twogram, n)) %>%
  ungroup() %>%
  ggplot(aes(reorder_within(x = twogram, by = n, within = Period), n, fill = Period)) +
    geom_col(show.legend = FALSE) +
  scale_x_reordered() +  
  coord_flip() +
    facet_wrap(~Period, ncol = 2, scales = "free") +
    ylab('n') +
    xlab("2-gram")

9.2 Co-occurance

Words can also co-occur in the same context even if they are not necessarily right next to each other. To get at this we’ll need more than n-grams. One way to do this is the widyr package, which essentially restructures our tidy data does an operation and the recasts it back into a tidy format.9

For instance, we can use the pairwise_count() function to find all the words that appear together in the same Nobel speech (not necessarily right next to each other).

library(widyr)
nobel %>%
  unnest_tokens(word, AwardSpeech) %>%
  filter(!word %in% stop_words$word) %>%
  pairwise_count(word, Year, sort = TRUE)
## # A tibble: 22,374,876 x 3
##    item1 item2     n
##    <chr> <chr> <dbl>
##  1 time  peace    89
##  2 peace time     89
##  3 prize peace    86
##  4 peace prize    86
##  5 peace nobel    85
##  6 nobel peace    85
##  7 war   peace    85
##  8 world peace    85
##  9 peace war      85
## 10 peace world    85
## # ... with 22,374,866 more rows
## # i Use `print(n = ...)` to see more rows

Notice that this is a much larger dataframe – our unnested dataframe minus stopwords was ~93.000 words, this new one is +21 million.

From here we might do many things using the same skills of subsetting, counting, and plotting we have talked to up to now. We filter for words co-occurring with “evil” and plot the most frequently occurring in the same document as “evil”. We might also note that unnest_tokens() will can also tokenize by sentence. This means that if you wanted to look at co-occurrence within sentences you could tokenize by sentence and give each sentence (which would occupy one row each) an index number (something like mutate(index = row_number()) might do the trick) and then call pairwise_count.

9.2.1 Excercises

  • Implement the above suggestion – tokenize by sentence and search for most frequently co-occurring words with keywords of your choice. Try this for either Nobel or SR corpora (NB. The SR dataframe is also structured one page of text per row, thus you could also look for co-occurrence within individual pages of the Sustainability Reports.)
  • Find most frequently occurring n-grams (of whatever size n) in the SR corpus. Any surprises?

References

Clark, Michael. 2018. “An Introduction to Text Processing and Analysis with R.” 2018. https://m-clark.github.io/text-analysis-with-R/.
Jurafsky, Dan, and James H Martin. forthcoming. “Speech and Language Processing. Vol. 3.” Pearson London London. http://www.web.stanford.edu/~jurafsky/slp3/.
Loughran, Tim, and Bill Mcdonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.
Mosteller, Frederick, and David L. Wallace. 1963. “Inference in an Authorship Problem.” Journal of the American Statistical Association 58 (302): 275. https://doi.org/10.2307/2283270.
Niekler, Andreas, and Gregor Wiedemann. 2020. “Text Mining in R for the Social Sciences and Digital Humanities.” 2020. https://tm4ss.github.io/docs/index.html.
Schweinberger, Martin. 2021. “POS-Tagging and Syntactic Parsing with R.” 2021. https://slcladal.github.io/tagging.html#2_POS-Tagging_with_openNLP.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. "O’Reilly Media, Inc.". https://www.tidytextmining.com/.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.". https://r4ds.had.co.nz/index.html.

  1. We could also start talking about tokens and words being not just things we’d recognize as words but things like :) and so on.↩︎

  2. We will also later talk about n-grams where we might take tokens to be 2-grams, sentences or even longer pieces of text.↩︎

  3. Interestingly, one of the most famous cases of computational text analysis in the social science – Mosteller and Wallace’s attribution of anonymously published Federalist papers – analyzed stopwords and threw out everything else (Mosteller and Wallace (1963)).↩︎

  4. This section and the next lean heavily on chapters 1 and 3 in Silge and Robinson (2017)↩︎

  5. Based on examples from Clark (2018), Schweinberger (2021), and especially Niekler and Wiedemann (2020).↩︎

  6. these are English-language, it should be said. There are many for other languages out there, see here for an extensive Norwegian positive/negative sentiment dictionary. Some of these will involve a little wrangling of the data to read into R and get them into tidy format.↩︎

  7. For some discussion of the nrc lexicon and more background on sentiment analysis, see chapter 20 in Jurafsky and Martin (forthcoming). One of the earliest and best known sentiment dictionaries is the General Inquirer.↩︎

  8. See chapter 13 in Wickham and Grolemund (2016) for great visualizations of joins and translation into R commands.↩︎

  9. There are simple statistics to test the correlation between words in the same document (phi-test – in widyr as pairwise_corr), statistics to measure differences in word use between documents (log-likelihood, chi-squared, keyness), and so on. These are a bit much to cover in a two day workshop but area all fairly easy to implement in R and you can easily find tutorials and explainers that will help you do this.↩︎



2022.